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Abstract 

Prokaryotic genomes are diverse in terms of their nucleotide and oligonucleotide composition as well as 
presence of various sequence features that can affect physical properties of the DNA molecule. We present 
a survey of local sequence patterns which have a potential to promote non-canonical DNA conformations 
(i.e. different from standard B-DNA double helix) and interpret the results in terms of relationships with 
organisms' habitats, phylogenetic classifications, and other characteristics. Our present work differs from 
earlier similar surveys not only by investigating a wider range of sequence patterns in a large number of 
genomes but also by using a more realistic null model to assess significant deviations. Our results show 
that simple sequence repeats and Z-DNA-promoting patterns are generally suppressed in prokaryotic 
genomes, whereas palindromes and inverted repeats are over-represented. Representation of patterns that 
promote Z-DNA and intrinsic DNA curvature increases with increasing optimal growth temperature (OGT), 
and decreases with increasing oxygen requirement. Additionally, representations of close direct repeats, 
palindromes and inverted repeats exhibit clear negative trends with increasing OGT. The observed relation- 
ships with environmental characteristics, particularly OGT, suggest possible evolutionary scenarios of struc- 
tural adaptation of DNA to particular environmental niches. 
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1. Introduction 

Prokaryotic genomes are extremely diverse in terms 
of their nucleotide and oligonucleotide composition 
as well as presence of various forms of sequence 
repeats and patterns that can affect physical properties 
of the DNA molecule. For example, Ussery et al) 
analysed oligopurine/oligopyrimidine runs and alter- 
nating purine/pyrimidine patterns in prokaryotic 
genomesand reported largedifferencesamongdifferent 
organisms. Alternating purine/pyrimidine patterns 
promote Z-DNA conformation, whereas oligopurine/ 
oligopurimidine runs can facilitate formation of A-DNA 
or H-DNA. 2 They found that differences among 
different prokaryotes were related to optimal growth 



temperature (OGT) and pathogenicity, possibly as a 
result of structural adaptations of DNA of these organ- 
isms. 3 In another example, potential G-DNA-forming 
sequences were found to occur at vastly different fre- 
quencies in different prokaryotic genomes. 4 The 
amount of intrinsic DNA curvature in promoter regions 
was found to be related to OGT. 5 Our own analysis of 
more than 1 000 prokaryotic chromosomes suggested 
a large variance amongdifferent genomes in DNAcurva- 
ture-related 10-11 bp sequence periodicity, which 
could reflect differences in chromosomal structure. 6 
The presence of long simple sequence repeats (SSRs) in 
a genome is strongly correlated with the organism's de- 
pendence on a eukaryotic host. 7 These results indicate 
a significant variance of general DNA properties among 
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different prokaryotes, which in some cases appear to be 
related to the organisms' habitats and lifestyles. 

Advancesin DNA sequencingtechnologiesoverthe past 
decades led to a situation where complete genomes are 
available for many microbes about which very little is 
known apart from the information that can be derived 
from the genomic DNA sequence. In particular, despite 
the diversity of DNA properties mentioned above, our 
knowledge about chromosome structure and organiza- 
tion in the cell is limited to studies of a few model organ- 
isms. While it is reasonable to assume that the general 
model of bacterial nucleoid composed of dynamic, super- 
coiled DNA loops stabilized by nucleoid-associated pro- 
teins is probably universal, 8,9 the differences in DNA 
sequence properties suggest that subtle variations may 
exist among bacteria with completely sequenced 
genomes. Such differences in physical characteristics of 
t he ch romosomes cou Id play roles i n the orga n isms' physi- 
ology and adaptations to their particular environments. 

We present a survey of local sequence patterns in pro- 
karyotic genomes with a potential to generate local ir- 
regularities in DNA structure. Our goal was to assess 
diversity of the prokaryotic genomes in terms of ab- 
undance of sequence patterns indicative of possible 
structural transitions in the DNA molecule. While 
physiological functions of sequence patterns promot- 
ing non-canonical DNA conformations and sequence 
patterns promoting conformational transitions in 
DNA are not well understood, there is increasing evi- 
dence that they can play important roles. For example, 
SSRs have been implicated in phase variation, a mech- 
anism that promotes reversible switching of pheno- 
types. 10,11 Palindromes and close inverted repeats 
form stem-loop structures in RNA, which can function 
in transcription terminators and riboswitches, and 
they can promote formation of cruciform structures 
in DNA, which influence replication, regulation of 
gene expression, and recombination. 12 Close direct 
and inverted repeats have mutagenic effects on 
DNA. 1 3-1 6 Specific guanine-rich patterns can promote 
formation of quadruplex G-DNA. 1 7-19 The G-DNA for- 
mation in telomeres is well documented, but G-DNA 
may also play a role in gene expression, replication, 
recombination, and integration of viruses. 1 9-23 
Alternating purine-pyrimidine patterns can under 
favourable conditions promote transitions to the Z- 
DNA conformation, while oligopurine/oligopyrim id ine 
runs promote formation of A-DNAortriple-stranded H- 
DNA. 2,24 ~ 28 Transientformation of left-handed Z-DNA 
can influence transcription and promote genome in- 
stability, 28-31 and indirect evidence suggests similar 
roles for triple-helical H-DNA. 26,32 Intrinsic DNAcurva- 
ture is largely related to periodicspacingsof A-tractsand 
can influence DNA-protein interactions in regulatory 
regions,aid DNA compaction in the nucleoid, and possibly 
promote a particular mode of supercoiling. 5,33,34 



Although the physiological significance of the unusual 
DNA conformations in different organisms is still poorly 
understood, it is reasonable to ask how common the se- 
quence patterns that promote their formation are in dif- 
ferent genomes because their unusually high or low 
occurrence may indicate its functional importance. 

Our present work differs from earlier similar surveys 
notonly by investigating a wider range of sequence pat- 
terns in a large number of genomes but also by using a 
more realistic null model to assess significant anomal- 
ies. Significant deviations from expected pattern occur- 
rences could indicate that the pattern perse is subject to 
selective constraint and therefore functionally signifi- 
cant in a given organism but the pattern over- or 
under-representation could also arise as a tolerated 
artefact of other biases affecting the DNA sequence. 
Our null model reflects the genome-specific nearest 
neighbour preferences, codon biases, and heterogen- 
eity of the DNA sequence; therefore, deviations from 
expected occurrences are more likely to reflect their 
functional significance of the investigated sequence 
patterns. We interpret our results in terms of relation- 
ships with organisms' habitats, phylogenetic classifica- 
tions, and other characteristics. 

2. Materials and methods 

2.1. Data sets 

Complete prokaryotic genomes were downloaded 
from the National Center for Biotechnology Information 
(NCBI) ftp server (ftp://ftp.ncbi.nih.gov/genomes/ 
Bacteria/). Pattern representations were assessed for the 
complete genomes, as well as for 1 Mb segments ran- 
domly selected from each genome. The purpose of the 
latter approach was to reduce the effect of statistical arte- 
facts from comparing sequences of different sizes. The 
results presented in this paperare based on the 1 Mbseg- 
ments, while the data for the complete genomes are 
shown in the Supplementary data (Supplementary 
Tables S5-S7). Genomes smaller than 1 Mb were 
excluded from the analysis. Our final data set included 
1424 complete genomes of 941 species, 519 genera, 
and 37 phyla or subphyla (Supplementary Table S1). 
We used the existing annotation (the CDS' keywords) to 
differentiate protein-coding and non-coding sequences. 

The OGTand oxygen requirement classifications for 
each genome were obtained from the Genomes Online 
Database (http://www.genomesonline.org). 35 Among 
the 1 424 complete genomes, 1 378 genomes have the 
OGT classification available and 1304 genomes have 
the oxygen requirement classification available. To elim- 
inate sampling biases towards genera represented by 
many completely sequenced genomes, we performed 
statistical assessments at the level of genera rather than 
individual genomes. Accordingly, we apply a single 
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classification to each genus, which is determined by the 
majority of species within the genus. This procedure 
yielded OGT classification for 498 genera (including 1 8 
psychrophiles and psych rotolerant organisms, 382 
mesophiles, 72 thermophiles, and 26 hyperthermo- 
philes) and oxygen requirement classification for 465 
genera (159 anaerobes, 202 aerobes, 95 facultative, 
and 9 microaerophilic organisms). 

2.2. Patterns of interest 

Table 1 provides a list of sequence patterns whose 
occurrences in prokaryotic genomes were investigated 
in this work. The patterns were selected based on the 
available information about sequences that promote 
various forms of structural transitions or mutations in 
DNA under favourable circumstances. However, 
because exact rules governing structural transitions in 
DNA are not fully understood, we use multiple forms 
of similar patterns, in most cases pertaining to varying 
length of the pattern and number of tolerated mis- 
matches. The specific parameters used in the definition 
of the patterns were in part dictated by the size of the 
analysed genomes in order to obtain sufficient data 
sample for statistical evaluations. 

It is worthwhile to note that we are not aiming to 
predict accurately every site in the genome that is 
likely to undergo a conformational transition underfa- 
vourable conditions. We are only asking how common 
are sequences favouring certain structural transitions 
in each genome and whether their frequency could 
be indicative of selective constraints acting on such 
sequences. At the same time, the extensive simulations 
with randomized genomes described below require 
that the patterns are sufficiently simple that they can 
be identified quickly. Given this limitation, we selected 
the sequence patterns to be approximately representa- 
tive of the sequences known to undergo the specific 
structural transitions under favourable conditions. 

2.3. Evaluation of pattern representations 

Pattern Locator 36 was used to count loci in each ana- 
lysed sequence that matched the sequence pattern at 
hand (the observed count). The expected count of 
matching loci was determined as the average number 
of matching loci in 2 0 randomized sequences. The ran- 
domized sequences were generated by the 'mid' 
model of the Genome Randomizer software previously 
developed in our laboratory 37 (available for download 
at http://www.cmbl.uga.edu/software.html). In this 
model, the analysed sequence is first divided into seg- 
ments corresponding to annotated protein-coding 
genes and intergenic regions. For each intergenic 
segment, a random sequence of the same length is gen- 
erated as the first-order Markov chain, thus preserving 
the nucleotide and dinucleotide composition of that 



specific intergenic segment. Analogously, a random se- 
quence is generated for each gene as a first-order 
Markov chain using the codon alphabet. The final ran- 
domized genome is constructed by reassembling the 
randomized genesand intergenic segments in theirori- 
ginal order. The resulting randomized sequence mimics 
not only the overall nucleotide, dinucleotide, and 
codon frequencies of the complete genome but rather 
of each individual gene and intergenic region. Thus, 
the null model incorporates the compositional hetero- 
geneity of the genomic sequence at the scale of individ- 
ual genes. Because we factor the codon usage biases, 
dinucleotide preferences, and differences between 
protein-coding and non-coding sequences into the 
null model, we are more likely to detect anomalous 
usage of sequence patterns that arises from direct selec- 
tion on the sequence patterns as opposed to anomalous 
usage that is a simple consequence of codon or di- 
nucleotide biases characteristic of the particular 
genome or its various segments. As a result, our assess- 
ments are more likely to reflect selective constraints 
and functional significance of the analysed sequence 
pattern than assessments utilizing a simpler null model. 

The pattern representations are classified into nine 
levels, from -4 (extremely under-represented), through 
0 (normally represented) to +4 (extremely over- 
represented) based on the P-value and the observed/ 
expected ratio (Supplementary Table S8). We use the 
combination of the two criteria because the observed/ 
expected ratio is independent of the sample size and 
measures the deviation from the null model in more ab- 
solute terms but does not directly provide assessment 
of statistical significance. Using P-value alone could em- 
phasize small deviations in large samples (e.g. large 
genomes or more frequent patterns), whereas using 
the observed /expected ratio alone could lead to 
over-interpretation of results that lack statistical signifi- 
cance. Assigning representation categories based on 
both criteria is a compromise designed to avoid these 
potential pitfalls. The P-values are assessed based on 
an assumption that the counts of matching loci follow 
the Poisson distribution. This is a reasonable approxi- 
mation for long sequences and patterns that occur at 
low frequencies. 

Statistical assessments could be skewed by inclusion 
of observations that are not independent, such as 
several closely related genomes (e.g. different strains 
of the same species or closely related species), which 
are likely to feature similar levels of representations of 
various sequence patterns simply as a result of insuffi- 
cient evolutionary divergence. To reduce the effect of 
dependent observations resulting from insufficient di- 
vergence, we combined the results for all genomes of 
the same genus into a single 'observation' by averaging 
pattern representation categories for all available 
genomes belonging to that genus. 
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Pattern 


Code 


Meaning 


Example 3 


Simple sequence 


1n8 


A single nucleotide repeated 8+ times in a row 


GCAAAAAAAAATA 


repeats 


2n5 


A dinucleotide repeated 5+ times in a row 


ACCACACACACATA 




3n4 


Analogous to the two examples above 






And. 
H I 14- 








5n4 








D I 1 J 








/no 








olio 








9n2 








10n2 








1 1n2 






L^lLoC U 1 1 CLl 


*T I IDg 1 Z 


r\ LtrLi d 1 1 ULItr(JLIUtr itrJJtrdLcU D-|- Lllllcb W1LI1 IddUb _^ I Z 


lr\v^^,r\l Lj^, 1 L_L_rtl lrv^k_rtlr\LjV^L_rtl . . . 


repeats 




1 1 L 






O 1 1 OgZ 


A 6-mer repeated 6+ times with gaps < 24 nt 






Qn/1 rrl A 

on^+gz^- 


An 8-mer repeated 4+ times with gaps < 24 nt 






LU otu 


r\i 1 O illcl IcUtrdLtrU W1LII1II D IIL 


CTTA CCC ATI" AITTTAP CC A 
^ 1 1 j * — r\ 1 L_M^L_ 1 lAUUCrt 






A 1 fl-mpr rpnpafpH within R 0 nt 
r\ I U II1CI 1 CJJCdLCU W 1 LI III L jU IIL 




Palindromes and 


cp8g6 


Inverted repeat or an 8-mer separated by no more 


CTTAGGCATCACTGCCTAAG 


inverted repeats 




LI Id [I O ML 






Lp I UgDU 


1 1 IVC I LCU [cfjed L Ul d I U Illcl Scpd I d LCU Uy ^ JU IIL 






pa ls9 


9 nt inverted repeat (no separation) OR 1 2 nt 


1 LjLjrtl L~MLjLjL- 1 f\f\f\. 1 1 L— MO^L- 1 L~MlL~L~/\Lj 






inverted repeat allowing 1 mismatch OR 1 5 nt 








inverted repeat allowing 2 mismatches OR . . . (one 








1 1 1 Ibl 1 Id LCI 1 dUULU 1 Ul CVL y O IIL 1 C 1 1 ii L 1 1 J 






m lr Q ft 1 0 
pdlS:/g I Z 


Like pals9 but allowing separation up to 1 2 bp 






pais i zgzu 


Analogous to the example above 




H-DNA-related 


cm8g6 


Mirror repeat of an 8-mer separated by <6 nt 


CTTAGGCATCACACGGATTC 


patterns 


cm1 0g50 


Mirror repeat of a 1 0-mer separated by <50 nt 






mirs9 


9 nt mirror repeat (no separation) OR 1 2 nt mirror 


CTGG ATCAGGCTAAA: AACTCGGACTG GGTC 






repeat allowing 1 mismatch OR 1 5 nt mirror repeat 








allowing 2 mismatches OR . . . (one mismatch added 








for every 3 nt length) 






iilliajg I Z 


Like mirs9 but allowing separation up to 1 2 bp 






m i rc 1 7 cr 9 0 
I I 1 1 1 b I Z eZ u 


Analnoni ic tn mircQol ~) 
rtl 1 d lUgU Lib LU 1 1 1 1 1 b J ii 1 Z 






Die; 


Run of ^15 purines or pyrimidines 


A A C CC A CCC A C C A C A 
r\r\\^ Lj Lj MLj Lj Lj /\Lj Lj MLj r\ 




R30 


Run of >30 purines or pyrimidines 






R30e3 


Run of >30 purines or pyrimidines allowing <3 








errors 






R45e6 


Analogous to the example above 






R60e9 








CC Hod. 


8 or more GG dimers separated by ^4 nt trom each 


cc accctcccccccccctcccc 
LjLjALjLjL, 1 LjLjL^LjLjLjLj^LjLj 1 LjLjLjLj 


patterns 




other 






GGG4g6 


Analogous to the example above 






GGGG4g6 






Z-DNA-related 


GC6 


Alternating G-C, >6 nt length 


GCGCGC 


patterns 


GC8 


Alternating G-C, >8 nt length 






RY1 2 


Alternating R-Y, > 1 2 nt length 


TGTACGTGTGCA 




RY1 2e1 


Like RY1 2 but allowing 1 error 


TGTACG A GTGCA 




RY1 8e2 


Alternating R-Y, > 1 8 nt length, <2 errors 






RY24e3 


Alternating R-Y, >24 nt length, <3 errors 




DNA bending 


bend45w60 


Predicted bend of >45° within a < 60 bp segment 






bend60w100 


Predicted bend of >60° within a < 1 00 bp segment 






bend90w1 20 


Predicted bend of >90" within a < 1 20 bp segment 





a Segments matching the sequence pattern are underscored, mismatches a re shaded, and symmetrical segments are separated by 
a vertical dashed line. 



3. Results 

3.1. Simple sequence repeats 

A strong avoidance of tandem repeats of mono-, di-, 
and tri-nucleotides spreads over all phyla, especially in 
the whole genome and protein-coding regions, while 



tandem repeats of longer oligomers are generally nor- 
mally represented and sometimes weakly over-repre- 
sented (Table 2 and Supplementary Table S9). 
However, tandem repeats of mono-,di-,and tri-nucleo- 
tides are generally less suppressed in the intergenic 
regions when compared with protein-coding regions 



Table 2. Representation of sequence patterns in different phyla 



Pattern name 


Pattern code 


AlPr 


BePr 


CaPr 


DcPr 


EpPr 


Firm 


Acti 


Cyan 


Baa 


Chlb 


ChK 


Dein 


Fuso 


Cilia 


Spir 


Acid 


Verr 


Defe 


Plan 


Aqui 


Tber 


Eury 


Cren 






58 


3 9 


76 


23 


1 1 


6 1 


63 


1 2 


34 


5 


8 


6 


5 


6 


4 


4 


4 


4 


4 


8 


5 


44 


1 G 


Simple sequence repeats 


1 nS 


-4.00 


-3.00 


-4.00 


-4.00 


-4.00 


-4.00 


-4.00 


-4.00 


-4.00 


-1.67 


-4.00 


-4.00 


-4.00 


-3.79 


-3.75 


-3.00 


-3.00 


-4.00 


-3.50 


-4.00 


-4.00 


-4.00 


-4.00 




2n5 


-2.21 


-2.00 


-3.00 


-3.00 


-4.00 


-4.00 


-3.00 


-3.63 


-4.00 


-3.00 


-3.00 


-2.60 


-4.00 


-4.00 


-3.50 


-2.00 


-2.50 


-4.00 


-2.00 


-4.00 


-4.00 


-3.00 


-3.00 






















































4 n4 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


-1.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 




5n4 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.07 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 




6n3 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.07 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 




7n3 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.75 


0.00 


1 .00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.50 


1 .00 


2.00 


0.00 


0.00 


0.00 


0.00 




8n3 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.50 


0.00 


0.00 


0.50 


0.00 


0.00 


0.00 


0.00 


0.00 




9 n2 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


-0.36 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 




10n2 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.17 


0.00 


0.00 


0.00 


0.00 


0.00 


1.00 


0.00 


0.00 




1 1 n2 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.59 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


Close direct repeats 


4n6gl2 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


1.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.13 


0.00 


0.00 


0.00 


1.25 


0.00 


0.00 


0.00 


0.00 




6n6g24 


0.42 


0.25 


0.88 


1.00 


0.00 


0.30 


2.00 


1 .50 


1.00 


1.00 


1 .00 


0.00 


0.00 


0.00 


0.98 


1.50 


0.00 


1 .50 


1.25 


0.00 


0.00 


0.10 


0.00 




8n4g24 


0.50 


1.00 


1.14 


2.00 


0.00 


1 .00 


3.00 


3.00 


1.50 


2.00 


1 .00 


0.00 


0.00 


0.00 


1.25 


2.25 


0.50 


1 .50 


2.75 


0.00 


0.00 


1 .00 


0.00 




Cd8g6 


0.00 


0.00 


0.00 


1.00 


1. 00 


0.00 


2.00 


1 .92 


0.00 


0.00 


0.75 


1. 20 


0.00 


0.00 


0.02 


0.00 


0.00 


0.00 


0.00 


0.00 


1.00 


0.60 


1.00 




Cdl0g50 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


1.00 


0.60 


0.00 


0.00 


0.00 


0.00 


0.00 


-0.1 9 


0.1 5 


0.00 


0.00 


-1.00 


1.00 


0.00 


0.00 


0.00 


0.00 


Palindromes and inverted 


cp8g6 


3.00 


3.00 


4.00 


3.50 


3.00 


4.00 


3.00 


1 .81 


3.00 


2.00 


4.00 


2.63 


3.00 


4.00 


2.86 


3.50 


4.00 


3.00 


1.50 


2.00 


4.00 


2.00 


2.00 


repeats 


cpl0g50 


3.00 


4.00 


4.00 


4.00 


3.83 


4.00 


4.00 


3.54 


4.00 


4.00 


4.00 


3.75 


4.00 


4.00 


3.69 


3.50 


2.00 


3.00 


3.50 


1 .00 


4.00 


1 .00 


1.00 




pals9 


1 ,09 


2.00 


3.35 


3.00 


2.00 


4.00 


4.00 


2.00 


4.00 


3.00 


3.50 


3.00 


3.00 


3.96 


2.17 


1.50 


2.00 


1 .50 


1.00 


0.00 


3.50 


0. 10 


0.00 




pa[s9gl 2 


3.66 


4.00 


4.00 


4.00 


3.72 


4.00 


4.00 


3.71 


4.00 


4.00 


4.00 


3.75 


4.00 


4.00 


3.77 


4.00 


4.00 


3.00 


4.00 


1 .00 


4.00 


2.00 


1.00 




palsl2g20 


4.00 


4.00 


4.00 


4.00 


4.00 


4.00 


4.00 


4.00 


4.00 


4.00 


4.00 


4.00 


4.00 


4.00 


3.77 


4.00 


4.00 


3.00 


4.00 


0.00 


4.00 


1 .00 


0.22 


H-DNA-related patterns 


cm8g6 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.34 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 




cm 1 0g50 


















































mirs9 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 




mirs9gl 2 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.25 


0.17 


0.00 


0.00 


0.34 


0.00 


0.00 


0.00 


0.25 


0.50 


0.00 


0.00 


0.25 


0.00 


0.00 


0.00 


0.00 




mirsl 2g20 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.09 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0. 13 


0.00 


0.50 


0.00 


0.25 


0.00 


0.00 


0.00 


0.00 




Rl 5 


-0.20 


0.00 


0.00 


0.00 


-1.00 


-0.20 


0.00 


-1.07 


0.00 


0.00 


0.25 


-1.00 


-4.00 


0.00 


-0.90 


-0.50 


1 .00 


-3.00 


0.00 


-2.00 


-3.00 


-1.00 


-1.00 




R30 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.38 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 




R30e3 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


-2.00 


-0.25 


-0.1 3 


0.00 


0.50 


-1.00 


0.00 


-0.50 


-1.50 


0.00 


0.00 




R45e6 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


-2.00 


0.00 


0.1 5 


0.00 


0.50 


0.00 


0.50 


-0.50 


0.00 


0.00 


0.00 




R60e9 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.07 


0.00 


0.00 


0.00 


0.50 


0.00 


0.00 


0.00 


0.00 


G-DNA-related patterns 


CC8g4 


1.00 


1.33 


0.00 


0.00 


0.00 


0.00 


1.00 


1 .00 


0.00 


0.00 


0.00 


2.70 


0.00 


0.00 


0.54 


1.00 


0.00 


0.00 


1.50 


0.00 


0.00 


0.00 


0.00 




CCC4g6 


0.00 


0.00 


0.00 


0.00 


0.00 


-0.63 


0.00 


0.00 


0.00 


0.00 


-0.31 


1 .00 


0.00 


0.00 


-0.25 


0.00 


0.00 


0.00 


0.00 


-0.50 


0.00 


-0.42 


0.00 




CCCC4g6 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.1 7 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


Z-DNA-related patterns 


GC6 


-1.00 


-1.00 


-2.00 


-2.00 


-1.00 


-1.00 


-2.00 


-2.00 


-1.00 


-4.00 


-2.00 


-1.20 


0.00 


0.00 


-0.77 


-1.50 


-1.00 


-0.50 


-2.75 


-0.25 


0.00 


-2.00 


-1.00 




GC8 


-0.50 


-1.00 


-1.00 


-1.00 


0.00 


0.00 


-3.00 


-0.25 


0.00 


-2.00 


-1.34 


-1.25 


0.00 


0.00 


0.00 


-0.50 


0.00 


0.00 


-3.00 


0.00 


0.00 


0.00 


0.00 




RY12 


-2.50 


-2.00 


-3.00 


-1.00 


-0.89 


-0.33 


-2.00 


-2.00 


-0.35 


-2.00 


-3.25 


-0.25 


0.00 


0.42 


-0.09 


-0.75 


0.00 


-1.00 


-3.00 


0.00 


0.00 


-0.72 


0.00 




RY1 2el 


-2.00 


-1.40 


-3.00 


-1.00 


-1.00 


-0.25 


-2.00 


-3.00 


0.00 


-3.00 


-2.25 


0.00 


0.00 


0.84 


0.05 


-0.75 


0.00 


-1.00 


-3.00 


0.50 


0.00 


-0.64 


-0.98 




RV1 8e2 


-2.65 


-2.00 


-2.53 


-1.00 


0.00 


0.00 


-2.00 


-2.00 


0.00 


-2.00 


-3.00 


-1.00 


0.00 


0.75 


0.00 


-0.75 


0.00 


0.00 


-3.50 


0.00 


0.00 


0.00 


0.00 




RY24e3 


-1.03 


-1.50 


-1.00 


0.00 


0.00 


0.00 


-1.00 


-1.00 


0.00 


-1.00 


-1.25 


0.00 


0.00 


0.00 


-0.1 7 


-1.00 


0.00 


0.00 


-1.00 


0.00 


0.00 


0.00 


0.00 


DNA bending 


bend45w60 


0.2S 


0.00 


0.07 


1.00 


3.00 


2.00 


0.00 


1 .50 


2.00 


1.00 


0.17 


0.4 5 


2.00 


1.50 


0.5 0 


0.00 


0.00 


2.50 


0.00 


2.00 


2.00 


2.00 


0.00 




bend60wl 00 


0.00 


0.00 


0.00 


0.50 


2.97 


2.00 


0.00 


1 .00 


2.00 


0.00 


0.00 


0.00 


2.00 


0.75 


0.25 


0.00 


0.00 


3.00 


0.00 


2.25 


3.00 


2.00 


0.00 




bend90w!20 


0.00 


0.00 


0.00 


0.00 


3.00 


1 .00 


0.00 


1 .00 


1.00 


0.00 


0.00 


0.00 


3.00 


0.00 


0.20 


0.00 


0.00 


3.50 


0.00 


2.00 


3.00 


1 .00 


0.00 



p 



Numbers in the table refer to the medians of pattern representation among all genera of the corresponding phylum. The pattern representations were categorized into 
nine categories from -4 (extremely under-represented), through 0 (normally represented), to +4 (extremely over-represented). See Materials and methods for details. 
A colour version of this table is presented in Supplementary Materia Is as Table S8. Codes in the second column refer to specific sequence patterns (see Table 1 ). Columns 
represent different phyla abbreviated as follows: AlPr, a-proteobacteria; BePr, /3-proteobacteria; GaPr, y-proteobacteria; DePr, S-proteobacteria; EpPr, e-proteobacteria; 
Firm, Firmicutes; Acti, Actinobacteria; Cyan, Cyanobacteria; Bact, Bacteroidetes; Chlb, Chlorobi; Chlf, Chloroflexi; Dein, Deinococcus-Thermus; Fuso, Fusobacteria; Chla, 
Chlamydiae; Spir, Spirochaetes; Acid, Acidobacteria; Verr, Verrucomicrobia; Defe, Deferribacteres; Plan, Planctomycetes; Aqui, Aquificales; Ther, Thermotogae; Eury, 
Euryarchaeota; Cren, Crenarchaeota. Numbers in the second row indicate the number of genera available for each phylum. Only phyla represented by three or more 
genera are shown. 
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(Supplementary Tables S1 0 and S1 1). Trichodesmium 
erythraeum and Methanosphaera stadtmanae are extreme 
exampleswith strongly over- re presented mono-nucleo- 
tide repeats 1n8 (a single nucleotide repeated >8 
times) in the intergenic regions (P< 1 0~ 12 ), but nor- 
mally represented in the whole genome and extremely 
under-represented in the protein-coding regions. Most 
Chlorobi (6 species out of 1 0) also tend to have over- 
represented 1n8 pattern in the intergenic regions. 
Tandem repeats of longer oligonucleotides (6-1 1 bp) 
are extremely over- re presented in intergenic regions 
as well as complete genomes of Methanococcus voltae, 
Methanococcus aeolicus, and Natrialba magadii (and to 
a lesser extent in several other genomes) but not in 
the protein-coding segments (Supplementary Tables 
S2-S4). 

Mono- and dinucleotide repeats are about equally 
suppressed in psychrophiles, mesophiles,thermophiles, 
and hyperthermophiles (Table 3 and Supplementary 
Table S1 2). However, the suppression of trinucleotide 
tandem repeats is more pronounced in thermophiles 
and hyperthermophiles than in mesophiles and psy- 
chrophiles (significant at P<10~ 5 ; the P-value is 
based on Fisher's exact test for 2-by-2 contingency 
tables). Oxygen requirement appears to have no rela- 
tionship to tandem repeat representations. 

3.2. Close direct repeats 

In general,close direct repeats are normally represented 
orslightly over-represented incomplete genomes of most 
prokaryotes and in the intergenic regions, and they are 
mostly normally represented in genes (Supplementary 
Tables S9-S1 1). The cd10g50 repeats (two copies of 
the same 1 0-mer separated by <50 nucleotides) 
exhibit the strongest trends among the investigated 
direct re peat structures. Theyare mostly over-represented 
in the intergenic regions, weakly under-represented in 
protein-coding regions, and normally represented in the 
whole genome. It is interesting to note the contrast in 
representations of close pairs of 1 0 bp repeats 
(cd1 0g50) and multiple close copies of shorter oligonu- 
cleotides (4n6g12, 6n6g24, 8n4g24). While the 
former are often strongly over-represented in intergenic 
regions and normally represented to weakly under-repre- 
sented in genes, the latter are normally represented or 
moderately over-represented in both genes and inter- 
genic regions (Supplementary Tables S1 0 and S1 1 ). 

Representations of close direct repeats exhibit clear 
negative trends with increasing OGT (Table 3 and 
Supplementary Table S1 2). For example, clustered 
repeats 8n4g24 (at least four copies of the same 
octamer separated by gaps of no more than 24 bp) are 
over-represented in ~50% of psychrophiles and meso- 
philes but in few thermophiles and virtually no hyperther- 
mophiles (Fig. 1 ; the difference is significant at P < 1 0~ 6 ). 



A significant decrease in pattern representations with in- 
creasing OGT is seen also for patterns 6n6g24 
and8n4g24, indicating a general trend of lower local re- 
petitiveness in thermophiles and hyperthermophiles. 
There is no clear relationship between the representation 
of close direct repeats and oxygen requirement. 

3.3. Palindromes and inverted repeats 
Palindromes and inverted repeats are strongly over- 
represented in the complete genomes across almost all 
bacterial phyla but less so in archaea and Aquificales 
(Table 2 and Supplementary Table S9). There are few 
organisms that strongly avoid palindromes in protein- 
coding regions and the complete genomes, including 
cyanobacteria Synechosystis and Thermosynechococcus 
(Supplementary Tables S2 and S3). Representations of 
palindromes and close inverted repeats in protein-coding 
sequences vary significantly among different genera, 
from extremely over-represented (e.g. Cyanobacterium, 
Thermoanaerobacter, Allivibrio, Azobacteroides) to strongly 
suppressed (e.g. Rhodomicrobium, Pbenylobacterium, 
Jannaschia, Clavibacter). Many a-proteobacteria exhibit 
some level of palindrome suppression. Clavibacter is an 
outlier among Actinobacteria, which generally tend to 
have palindromes normally represented and in some 
cases over-represented (Supplementary Table S3). 

A comparison between genes and intergenic regions 
shows that inverted repeats are almost invariably over- 
represented in the intergenic regions and generally 
normally represented in the protein-coding regions 
(Supplementary Tables S3 and S4). The high concentra- 
tion of palindromes and inverted repeats in the inter- 
genic region is not necessarily surprising because of 
theirfunction in transcription termination and as regu- 
latory elements, which are generally located outside 
the protein-coding segments of genes. 

There is a general trend of decreasing representations 
of palindromes and inverted repeats with increasing 
OGT (Fig. 1 , Table 3, and Supplementary Table S1 2). 
In contrast, there is only a weak or no relationship 
between representations of palindromes and oxygen 
requirement. 

3.4. Oligopurine I oliqopyrimidine runs and triplex 
DNA-promoting patterns 

H-DNA triplexes form underfavourable conditions in 
oligopurine/oligopyrimidine sequences with a mirror 
symmetry, 24,26 whereas some oligopurine/oligopyri- 
midine segments can also promote formation of A- 
like DNA. 2 Mirror repeats and extended oligopurine/ 
oligopyrimidine stretches are generally normally repre- 
sented incomplete genomes of most phyla (Table 2 and 
Supplementary Table S9), with moderate over-repre- 
sentations in some Cyanobacteria (Anabaena, Nostoc, 
Microcystis, Trichodesmium), Chloroflexi (Chloroflexus 



Table 3. Representation of sequence patterns in different OGTand oxygen requirement classes 



Pattern name 


Pattern code 


Psychrophile 
1 8 


Mesophile 
382 


Thermophile 
72 


Hyperthermophile 
26 


Anaerobe 
1 59 


Aerobe 
202 


Facultative 
95 


Microaerophile 
9 


— j 

Simple sequence repeats 


1 n8 


— 3.83 


— 3.54 


— 3.89 


— 4.00 


— 3.73 


— 3.61 


— 3.65 


— 3.67 




2n5 


- 3.1 5 


— 2.89 


— 3.40 


-3.45 


- 3.30 


— 2.74 


— 3.02 


- 3.1 1 




3n4 


— 1 .93 


— 1 .92 


— 2.83 


-3.09 


- 2.56 


— 1.75 


- 2.1 5 


-2.33 




4n4 


-0.06 


0.01 


-0.06 


-0.08 


-0.04 


0.01 


0.00 


-0.10 




5n4 


0.00 


0.07 


0.00 


0.00 


0.02 


0.10 


0.01 


0.1 1 




6n3 


0.20 


0.19 


0.06 


-0.03 


0.05 


0.25 


0.08 


0.31 




7n3 


1 .27 


0.48 


0.45 


0.00 


0.42 


0.59 


0.35 


0.89 




8n3 


0.56 


0.30 


0.1 2 


0.01 


0.25 


0.27 


0.20 


0.33 




9n2 


-0.08 


0.09 


0.22 


0.04 


0.09 


0.19 


-0.1 1 


-0.1 1 




1 0n2 


0.09 


0.25 


0.25 


0.28 


0.27 


0.24 


0.1 6 


0.00 




1 1 n2 


0.21 


0.32 


0.30 


0.40 


0.42 


0.26 


0.24 


0.03 


Close direct repeats 


4n6g1 2 


0.45 


0.58 


0.48 


0.82 


0.55 


0.73 


0.31 


0.50 




6n6g24 


1.35 


1.1 6 


0.71 


0.34 


1.00 


1.21 


0.81 


1.39 




8n4g24 


2.02 


1.61 


0.80 


0.28 


1.27 


1.69 


1.18 


1.13 




cd8g6 


0.39 


0.73 


0.80 


1 .20 


0.87 


0.82 


0.41 


0.70 




cd1 0g50 


0.46 


0.41 


0.05 


-0.07 


0.25 


0.60 


0.01 


0.07 


Palindromes and inverted repeats 


cp8g6 


3.57 


2.78 


2.81 


1.99 


2.98 


2.39 


3.21 


3.22 




cp1 0g50 


3.56 


3.26 


2.98 


1.64 


3.1 5 


3.03 


3.45 


3.20 




pals9 


3.39 


2.58 


2.1 3 


0.71 


2.47 


2.27 


2.89 


2.47 




pals9g1 2 


3.83 


3.46 


3.05 


1.79 


3.33 


3.22 


3.55 


3.64 




pals! 2g20 


3.72 


3.58 


2.97 


1 .1 7 


3.1 6 


3.47 


3.62 


3.00 


H-DNA-related patterns 


cm8g6 


0.02 


0.20 


0.2 1 


0.33 


0.24 


0.24 


0.1 5 


0.04 




cm 1 0p50 


0.02 


0.1 6 


0.1 0 


0.1 5 


0.1 6 


0.1 8 


0.1 0 


0.00 




mirs9 


0.00 


0.1 3 


0.19 


0.1 7 


0.14 


0.1 8 


0.08 


-0.02 




mirs9g1 2 


0.1 5 


0.30 


0.35 


0.44 


0.33 


0.38 


0.20 


0.00 




mirsl 2g20 


0.1 5 


0.27 


0.28 


0.20 


0.26 


0.33 


0.16 


0.22 




R1 5 


-0.47 


-0.52 


-0.96 


-1.77 


-0.91 


-0.55 


-0.34 


-1.78 




R30 


0.03 


0.07 


0.08 


0.1 1 


0.1 0 


0.04 


0.07 


0.00 




R30e3 


0.08 


-0.01 


-0.1 8 


-0.66 


-0.23 


0.00 


0.1 4 


-0.66 




R45e6 


0.1 7 


0.08 


-0.03 


-0.32 


-0.06 


0.1 0 


0.1 5 


-0.02 




R60e9 


0.1 5 


0.1 3 


0.1 1 


-0.03 


0.1 4 


0.1 0 


0.1 2 


0.00 


G-DNA-related patterns 


GG8g4 


0.22 


0.72 


0.24 


0.14 


0.20 


1.03 


0.48 


0.44 




GGG4g6 


-0.08 


-0.26 


-0.78 


-0.65 


-0.67 


-0.1 8 


-0.26 


-0.22 




GGGG4g6 


0.00 


0.1 0 


-0.06 


-0.1 2 


0.01 


0.1 1 


0.02 


0.1 1 


Z-DNA-related patterns 


GC6 


-1.37 


-1.50 


-1.35 


-0.95 


-1.41 


-1.38 


-1.71 


-1.42 




GC8 


-0.75 


-1.10 


-0.84 


-0.1 8 


-0.47 


-1.42 


-1.13 


-0.37 




RY1 2 


-1.79 


-1.58 


-0.88 


-0.16 


-0.78 


-1.73 


-2.03 


-0.99 




RY1 2e1 


-1.81 


-1.40 


-0.71 


-0.31 


-0.85 


-1.43 


-1.93 


-0.44 




RY1 8e2 


-1.66 


-1.45 


-0.72 


-0.04 


-0.58 


-1.60 


-2.07 


-0.37 




RY24e3 


-0.60 


-0.78 


-0.25 


0.05 


-0.30 


-0.79 


-1.12 


-0.44 


DNA Bending 


bend45w60 


0.60 


0.93 


1.51 


1 .1 1 


1.57 


0.59 


0.85 


1.70 




bend60w1 00 


0.32 


0.79 


1.40 


1 .1 7 


1.42 


0.50 


0.74 


1.48 




bend90w1 20 


0.28 


0.58 


1.23 


0.94 


1.1 1 


0.34 


0.66 


1.44 



Numbers in the table refer to the average significance category for all genera within each class of organisms. Numbers below the class description indicate numbers of 
available genera of each class. Anaerobe includes both obligate anaerobes and anaerobes; Aerobe includes both obligate aerobes and aerobes. See Supplementary Table 
S5 for coloured version of this table. 
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Figure 1. Comparison of representations of selected patterns in different OCT classes. Bars show the percentage of species in each OGT class 
which have the given pattern under- represented, normally represented, or over-represented. The pattern is considered over- represented if 
the P-value is <1 0~ 4 and observed to expected ratio >1 .1 0 (representation level 2 or higher) for majority of the complete genomes 
available for that genera, it is deemed under-represented if the P-value is <1 0~ 4 and observed to expected ratio <0.91 , and normally 
represented otherwise. See Materials and methods and Supplementary Table S8. The four patterns for which the data are shown are 
representative of close repeat structures (8n4g24, top left), palindromes and close inverted repeats (pals9g1 2, top right), potential Z- 
DNA-promoting patterns (RY1 2, bottom left), and DNA bending pattrens (bend60w1 00, bottom right). See Table 1 for description of 
the pattern codes. 



and Roseiflexus), Planctomycetes (Isosphaera), and 
Verrucomicrobia (Methylacidiphilum) (Supplementary 
Table S2). However, oligopurine and oligopyrimidine 
stretches, mainly patterns R1 5 and R30e3, are sup- 
pressed in complete genomes of some phyla. Separate 
analysis of protein-coding and non-coding regions 
shows that oligopurine/oligopyrimidine stretches 
tend to be normally represented in intergenic regions 
and that anomalous representations of these patterns 
in some phyla arise from biases in protein-coding seg- 
ments (Supplementary Tables S1 0 and S1 1 ). 

Suppression of oligopurine/oligopyrimidine stretches, 
especially R1 5, is most common among hyperthermo- 
philes, whereas mesophiles and psychrophiles exhibit 
normal representations of these patterns (Table 3 and 
Supplementary Table S1 2). The R1 5 pattern is also 
more likely to be under-represented in anaerobes than 
in aerobes (Table 3, Supplementary Table S1 2, and 
Supplementary Fig. SI). On the other hand, mirror 
repeats cm10g50 and mirs12g20 are more likely 
to be over- re presented in the coding regions of anae- 
robes than aerobes (Supplementary Table S1 3 and 
Supplementary Fig. S2). 

3.5. Guanine-rich patterns and G-DNA-promoting 
sequences 

Formation of intrastrand G-DNA quadruplex is pro- 
moted by clustered short runs of guanine. 1 7-1 9 We 
investigated several forms of such G-rich clusters 



(G-patterns; Table 1 ), among which the GGG4g6 pattern 
(four G-triplets separated by gaps of no more than six 
nucleotides) represents best the sequences known to 
form G-quadruplexes. 2,38,39 In general, the G-patterns 
are normally represented in prokaryotic genomes 
(Table 2 and Supplementary Table S9). GG dimerclus- 
ters (GG8g4,eightor more GGdimers separated by <4 
nucleotides from each other) tend to be moderately 
over-represented in protein-coding regions of a-, (3-, 
and 8-proteobacteria, Actinobacteria, Cyanobacteria, 
Deincoccus-Thermus group, and Planctomycetes (es- 
pecially for Isosphaera) (Supplementary Table S1 0). 
The other guanine patterns are generally normally 
represented with most notable exceptions among 
Cyanobacteria where species of Microcystis, Anabaena, 
and Nostoc feature extreme over-representation of all 
forms of G-patterns (Supplementary Table S2). The 
planctomycete Isosphaera pallida also has G-patterns 
strongly over-represented. With these few exceptions, 
the GGG4g6 pattern, which is most directly related to 
G-DNA formation, is mostly normally represented or 
slightly under-represented, suggesting that if G-DNA- 
promoting sequences have significant physiological 
roles in bacteria, such roles are probably limited to 
only a few species or genera. 

Most organisms with over- re presented GG8g4 
pattern are aerobes, whereas virtually no anaerobes 
have excess of GG8g4 (Supplementary Fig. S1). 
However, over-represented GG8g4 pattern could also 
reflect excess of glycine-rich segments (encoded by 
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the GGN codons) or proline-rich segments (encoded by 
CCN with GG dinucleotides in the complementary 
strand) in proteins, whereas the GGG4g6 does not 
exhibit a relationship with oxygen requirement. There 
is no apparent relationship between G-pattern repre- 
sentations and the OGT (Table 3 and Supplementary 
Table S1 2). 

3.6. Z-DNA-promoting patterns 

The left-handed Z-DNA conformation is most com- 
monly adopted by runs of alternating G-C and more 
generally by alternating purine-pyrimidine (RY) pat- 
terns. The RY patterns are often under-represented 
buttodifferentextent in different phyla— most strongly 
in a-, (3-, and 7-proteobacteria, Actinobacteria, 
Cyanobacteria, Chlorobi, Chloroflexi, and Planctomycetes 
(Table 2 and Supplementary Table S9). In contrast, the 
RY patterns are normally represented in Fusobacteria, 
Aquificales, and Thermotogae and even slightly over- 
represented in Chlamydiae. In general, the suppression 
of RY patterns is stronger in protein-coding regions, al- 
though several phyla exhibit significant RY pattern sup- 
pression in intergenic regions as well (Supplementary 
Tables S1 0 and S1 1). Interestingly, Treponema palli- 
dum and Treponema paraluiscuniculi exhibit a strong 
over-representation of all forms of RY patterns, 
whereas other spirochaetes, including other species 
of Treponema, have RY patterns normally represented 
or weakly under-represented. Similar extreme over- 
representation of RY patterns applies to Helicobacter 
fells and to a lesser extent to Helicobacter bizzozeronii 
but not to other Helicobacter species. The archaeon 
Thermofilum pendens also exhibits a strong over-represen- 
tation of RY patterns and several other genomes show a 
weaker RY-pattern over-representation (Supplementary 
Table S2). 

We used Pattern Locator and associated software 
tools 36,40 (http://www.cmbl.uga.edu/software/patloc. 
html) to investigate in detail the distribution of RY 
patterns in several specific genomes, including the 
above-mentioned genomes with over-represented RY 
patterns. We did not find any significant anomalies in 
the distribution of the RY patterns with respect to the 
origin or terminus of replication, with respect to the 3' 
ends of genes (stop codons), or any strong association 
with a particular class of genes. However, we noted 
increased numbers of RY patterns overlapping with 
start codons, such that the ATG/GTG start codons are 
embedded in RY patterns which sometimes extend 
several codons deep into the protein-coding region 
(Supplementary Fig. S3). This tendency appears to be 
widespread among prokaryotic genomes. However, 
comparison between ATG and GTG triplets that func- 
tion as start codons and those that are not translation 
start sites shows no significant difference in fractions 



of ATG/GTG triplets overlapping with extended RY pat- 
terns between the two groups (Supplementary Table 
S1 8). Thissuggeststhatthe increased numberof RY pat- 
terns overlapping with translation start sites is not dir- 
ectly related to the translation but rather a simple 
consequence of ATG and GTG themselves having the 
form of a short RY pattern (RYR), thus increasing the 
likelihood of find ing a longer RY pattern at the same site. 

Representations of RY patterns tend to increase 
with increasing OGT (Table 3 and Supplementary 
Table S1 2). For example, the pattern RY1 2 is under- 
represented in 60% of psychrophiles, 50% of meso- 
philes, ~25% of thermophiles but only a single 
hyperthermophilic genus, Thermaerobacter (Fig. 1; 
significant at P< 1 0" 6 ). 

Among all analysed sequence patterns, the RY patterns 
exhibit the most notable trend with respect to the level of 
oxygen requirement (Table 3 and Supplementary Table 
S1 2). Interestingly, facultative species exhibit the stron- 
gest suppression of RY patterns followed by aerobes, 
whereas RY pattern representations in anaerobes and 
microaerophilic organisms are close to normal (see 
Supplementary Fig. S1c for pattern RY1 2). The same 
trend was observed when the comparisons between 
aerobes and anaerobes were restricted to mesophiles, 
or bacteria (Supplementary Tables S1 5 and S1 7), indi- 
cating that the relationship between RY patterns and 
oxygen requirement cannot be attributed to increased 
numberof anaerobes among thermophiles or different 
representations of anaerobes and aerobes among bac- 
teria and archaea. 



3 . 7. Patterns contributing to DNA curvature 

Sequence patterns related to DNA curvature (the 
bend-patterns) range generally from normal to over- 
represented. The over-representation is most pronounced 
in e-proteobacteria, Fusobacteria, Deferribacteres, and 
Thermotogae (Table 2 and Supplementary Table S8). 
This trend applies to both protein-coding and in- 
tergenic regions, although differences among phyla 
are more prominent in protein-coding regions 
(Supplementary Tables S9 and S1 1). At the opposite 
extreme, Mycoplasma baemofelis, Mycoplasma penetrans, 
Cytophaga butchinsonii, and Pelagibacter sp. show mod- 
erate to strong suppression of DNA bends in their 
genomes (Supplementary Table S2). The genus 
Mycoplasma is particularly interesting, featuring species 
with extreme over-representation of DNA bends as well 
as strong under-representation. This is consistent with 
our previous report that various genome properties 
among Mycoplasma vary more than in other genera. 37 
In addition, C. hutcbinsonii and Pelagibacter sp. feature 
extreme suppression of bend-patterns in protein-coding 
regions but not in intergenic regions (Supplementary 
Tables S3 and S4). 
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With respect to OGT, intrinsic DNA bends tend to be 
moreover-represented in hyperthermophilesandther- 
mophilesthan in mesophilesand psychrophiles(Table3 
and Fig. 1 ).Surprisingly,the numberof genera with over- 
represented bend-patterns in protein-coding segments 
increases with increasing OGT in protein-coding 
regions but decreases for bend-patterns in non-coding 
regions (Supplementary Fig. S4). The same trend is 
shown in Supplementary Fig. S5, where genomes of ther- 
mophiles and hyperthermophiles tend to have more 
protein-coding bends than intergenic bends, whereas 
the opposite is true for most psych rophiles. Statistical sig- 
nificance ofthistrend has been confirmed bytheMann- 
Whitney (J-test (Supplementary Table S1 9). 

With respect to oxygen requirement, representation 
of DNA bending patterns tends to decrease with the in- 
creasing level of oxygen (Table 3 and Supplementary 
Table S1 2). Specifically, the bend-patterns tend to be 
over-represented in anaerobes and microaerophiles but 
not in aerobes. For example, the pattern bend60w1 00 
is over-represented in half of the anaerobes, but <20% 
of aerobes (Supplementary Fig. 1; significant at P< 
1 0~ 6 ). This trend holds when the data are restricted to 
mesophiles or thermophiles, suggesting that the trend 
with respect to oxygen requirement is independent of 
the trend with respect to OGT (Supplementary Tables 
S15andS16). 

4. Discussion 

4. 1 . Comparisons with previous work 

The work presented here differs from similar earlier 
surveys in scope (both the number of genomes used 
in the analysis and the number of different types of se- 
quence patterns investigated) as well as methodology. 
The most important methodological difference is in 
the null model used to assess whether a sequence 
pattern is anomalously represented in a given genome. 
Our null model takes into account the nearest-neigh- 
bour biases and codon usage propensities of all individ- 
ual genes and intergenic regions, which likely have 
separate underlyingcauses not related to potential func- 
tional significance of the investigated sequence patterns. 
In terms of our results, one significant difference relative 
to earlier work concerns representations of alternating 
purine-pyrimidine patterns. Bohlin et al. 3 reported 
that alternating RY patterns were generally suppressed 
across different phyla except (3-proteobactria, where 
they were mostly over- re presented, with a particularly 
strong surplus of RY patterns in Burkholderia. These 
authors used the i.i.d. model (independently drawn 
and independently distributed letters) as a benchmark. 
Usingourmore realistic null model, we found that p-pro- 
teobactria includingBwr/Jjo/der/aspeciessuppresstheRY 
patterns to the same extent as other bacteria (Table 2 



and Supplementary TablesS2 and S9). We therefore con- 
clude that the increased amount of RY patterns in 
|3-proteobactria arises from two opposite evolutionary 
constraints: biased codon and dinucleotide usages 
that favour RY-patterns and are specific for (3-proteo- 
bactria, which are partially offset by suppression of 
long RY-patterns, which is rather universal among bac- 
teria I genomes. Similarly, the over-representation of oli- 
gopurine/oligopyrimidine stretches reported by Bohlin 
etal. for some phyla arises from the biases atthe level of 
dinucleotide and codon usages, as we found the oligo- 
purine/oligopyrimidine stretches normally repre- 
sented or under-represented in all phyla (Table 2 and 
Supplementary Table S9). 

For SSRs, our results are consistent with previous 
works including our own survey of SSRs in prokaryotic 
genomes. 7 In particular, our present data confirm that 
SSR comprised tandem repeats of very short units 
(mono-, di-, and trinucleotides; patterns 1 n8, 2n5, and 
3n4) tend to be strongly suppressed in prokaryotic 
geno mes, wh i le re peats of Ion ge r u n its (tet ra n ucleot ides 
through 1 1-mers) tend to be normally represented or 
weakly over-represented (Table 2 and Supplementary 
TableS9).Thisstrongdifference points to likely function- 
al difference between tandem repeats of very short (1 - 
3 bp) and longer (4-1 1 bp) units. Our present results 
differfrom the earlier data in that we previously included 
repeats of tetra nucleotides in the same group as mono-, 
di, and trinucleotides, 7 while the present results indicate 
that tetranucleotide SSRs may be more accurately 
included in the same group as SSRs composed of penta- 
nucleotides and longer units, which are generally not 
suppressed. This result suggests that tandem repeats of 
tetra nucleotides and longer units may not have the 
same harmful effects as repeats of mono-, di-, and trinu- 
cleotides. The difference between tandem repeats of 
very short units (1-3 bp) and longer units (>4 bp) 
could also arise from properties of the methyl-directed 
repair pathway, which is very efficient in repairing heter- 
ologous loops of 1 -3 bp, but its efficiency dramatically 
drops as the length of the loop increases. 41,42 

Ladoukakis and Eyre-Walker 43 reported a small but 
significant excess of short inverted repeats (6-9 bp) 
in protein-coding sequences and proffered that the 
inverted repeats could arise by sequence-directed mu- 
tagenesis where an imperfect palindrome or inverted 
repeat converts to a perfect one due to tern plate switch- 
ing during replication. 1 5 Their result is consistent with 
that of Katz and Burge, 44 who found that native 
mRNA sequences in many bacterial genomes possess 
a significantly higher potential to form stable RNA sec- 
ondary structures than random sequences preserving 
the codon usage, dinucleotide frequencies, and the 
protein sequences encoded by the native mRNAs. Katz 
and Burge proposed that the excess of base pairing in 
mRNA molecules is due to selection for stable mRNAs. 
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We analysed occurrences of slightly larger palindromes 
than Ladoukakis and Eyre-Walker but with a similar 
result that palindromes and close inverted repeats are 
strongly over- re presented in both protein-coding and 
non-coding regions of genomes of most prokaryotic 
phyla. 

The excess of palindromes is weaker in the archaea 
and in the Aquificae than in other bacterial phyla. This 
result is similar to that of Katz and Burge,who reported 
a significant excess of mRNA secondary structures in 
most bacteria but few archaea, and not in Aquifex aeoli- 
cus (note that their study was based on a much smaller 
sample of genomes available at that ti me andA. aeolicus 
was a sole representative of Aquificae in their data set). 
The weaker over-representation of palindromes in 
archaea and Aquificae is related to general decrease in 
palindrome representations with increasing OGT 
(Fig. 1). The decrease in palindrome representations 
with increasingOGT might seem counterintuitive if se- 
lection for mRNA stability drives the excess of palin- 
dromes in mRNA sequences. One possible explanation 
is that higher temperatures prevent formation of 
stable mRNA structures even with an increased 
content of inverted repeats, thus making selection for 
inverted repeats moot, or perhaps higher temperature 
decreases the efficiency of palindrome conversion via 
sequence-directed mutagenesis. Yet another potential 
explanation for weaker over-representation of inverted 
repeats in thermophiles stems from the work by Paz 
etal., 45 who noted thatthermophilic mRNA sequences 
have increased purine-to-pyrimidine ratios compared 
with mesophiles, as well as excess of short oligopurine 
tracts. These authors proposed that purine-loading of 
mRNA sequences could increase their thermal stability. 
The resulting purine-pyrimidine imbalance could de- 
crease the base-pairing potential in the mRNA 
sequences and the purine bias could therefore contrib- 
ute to lowerexcessof palindromesand inverted repeats 
in thermophiles. 

4.2. Relationship with OGT and oxygen requirement 

Extreme temperature as well as presence of oxygen 
(via reactive oxygen species) cause damage to DNA 
and require cellular mechanisms to prevent or correct 
the damage in order to keep the cells viable. Such pro- 
tection can be enzymatic (e.g. detoxification pathways) 
but could also involve adaptations in DNA composition 
and structure.One particularly puzzlingquestion is how 
thermophiles prevent their DNA from denaturing. The 
simple explanation that DNA of thermophile is more 
GC-rich and thus more stable was mostly rejected as 
more data became available. 46,47 Kawashima et a/. 48 
proffered excess of RR and YY dinucleotides over RY 
and YRdi nucleotides as a characteristic of thermophiles 
but this has later been shown a poor indicator of 



thermophily. 47 We were therefore interested in 
finding whether any of the sequence patterns we ana- 
lysed could be related to differences in OGT and 
oxygen requirement, and possibly contribute to the 
p rotect io n of D N A f ro m t h e eff ects of ext re m e te m pe ra - 
ture and /or oxygen damage. 

One of our more surprising results concerns the rela- 
tionship of representations of intrinsic DNA bends with 
OGT. Bolshoy and coworkers 5,49 reported an excess of 
intrinsically bent DNA in intergenic regions of prokary- 
otic genomes, including segments containing putative 
promoters and transcription terminators. Notably, the 
amount of bent DNA in intergenic regions was signifi- 
cantly weaker in thermophiles than in mesophiles. In 
a separate work, Tolstorukov et al. 33 found intrinsic 
DNA bends distributed widely throughout bacterial 
genomes, including both intergenic and protein- 
coding regions, and they proposed that the intrinsic 
bends could play a role in compaction of the bacterial 
nucleoid. Ourpresentdatashowthat predicted intrinsic 
DNA bends are over-represented in many phyla in both 
protein-coding and non-coding segments. Interestingly, 
the protein-coding and non-coding regions exhibit op- 
posite trends with respect to OGT. For non-coding 
sequences, our results are consistent with the decrease 
in curved DNA in thermophiles observed by Kozobay- 
Avraham et al. 5 However, the representation of DNA 
bends in protein-coding sequences tends to increase 
with increasingOGT. 

We propose that the opposite trends in protein- 
coding and non-coding DNA segments reflect different 
roles of DNA bends. Intergenic bends are often asso- 
ciated with regulatory elements and could facilitate 
opening of the DNA double helix at the transcription 
initiation sites. 50,51 Decreased amount of intrinsic 
DNA bends in intergenic regions of thermophiles is con- 
sistent with experiments that showed that anomalous 
gel mobility characteristic of intrinsically curved DNA 
disappears at increased temperatures. 52 If sequence- 
directed intrinsic DNA bending is ineffective at high 
temperatures, then selection for intrinsic bends at regu- 
latory sites could be weak or absent. It is also possible 
that DNA bending is less important to transcription ini- 
tiation at high temperatures. Surprisingly, the excess 
content of intragenic bends, which are less likely to 
have direct regulatory functions but may contribute 
to establishing and maintaining the nucleoid structure, 
increases with increasing OGT. The latter result could 
possibly indicate that the intrinsic bends maintain 
their role in stabilizing the nucleoid structure even at 
high temperatures and that high OGT may require 
increased amount of intrinsic DNA bending to maintain 
adequate structural stability of the chromosome. 

In addition to palindromes and close inverted repeats 
discussed above, some direct repeat structures also 
feature decreasing representations with increasing 
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OGT (Table 3 and Fig. 1). We speculate that higher 
temperature might make the thermophile DNA more 
susceptible to illegitimate recombination, DNA poly- 
merase slippage, or other forms of mutations facilitated 
by short repeats, and that reducing the amount of 
repeats, both direct and inverted, could be a strategy 
to counteract increased recombination rates. Along 
these lines, thermophiles were reported to have lower 
mutation rates than mesophiles, possibly driven by se- 
lection to maintain thermostability of theencoded pro- 
teins. 53 Because close repeats are often sources of 
mutations, suppression of close repeats could be a 
part of the strategy to decrease overall mutation rates 
in thermophiles. 

Alternating RY patterns exhibit trends related to both 
OGT and oxygen requirement and these trends are 
independent of each other. The relationship with 
oxygen requirement is particularly intriguing. Facultative 
organisms suppress RY patterns more strongly than 
both aerobes and anaerobes. The main known structural 
effect of RY patterns isthat they can facilitate transitions to 
left-handed Z-DNA under favourable conditions. The B- 
to-Z transition in such sequences can be induced by tor- 
sional stress arising from processes that require DNA 
unwinding, such as transcription or replication. 28,31 The 
general suppression of RY patterns in prokaryotes suggests 
that B-to-Z DNAtransitions are generally undesirable. It is 
intriguing to speculate that Z-DNA could be more detri- 
mental to facultative organisms perhaps due to interfer- 
ence with effective regulation of a diverse ensemble of 
metabolic pathways related to the ability to grow in 
both aerobic and anaerobicconditions. However,the spe- 
cific mechanism of such interference between Z-DNA for- 
mation and transcriptional regulation is not clear. 

Overall, ourdata indicate potential new mechanisms 
how genome properties adapt to particular environ- 
ments. Most of these mechanisms are related to adap- 
tations to growth at high temperatures, which appear 
to be accompanied by reduction in overall repetitive- 
ness of the DNA sequence, reduced excess of palin- 
dromes, and reduced DNA curvature in regulatory 
regions accompanied by general increase in intrinsical- 
ly curved DNA in protein-coding sequences. 

Acknowledgements: We wish to thank Dr Barny 
Whitman and Hao Tong for stimulating suggestions, 
and Orion Mantione for his help in writing the program 
to predict DNA bends. 

Supplementary data: Supplementary data are 
available at www.dnaresearch.oxfordjournals.org. 

Funding 

This work is supported by the grant number DBI- 
09 502 66 from National Science Foundation. 



References 

1. Ussery, D., Soumpasis, D.M., Brunak, S., Staerfeldt, H.H., 
Worning, P. and Krogh, A. 2002, Bias of purine stretches 
in sequenced chromosomes, Comput. Chew., 26, 
531 -41 . 

2. Sinden, R.R. 1 994, DNA Structure and Function. Academic 
Press: San Diego. 

3. Bohlin, J., Hardy, S.P. and Ussery, D.W. 2009, Stretches of 
alternating pyrimidine/purines and purines are respect- 
ively linked with pathogenicity and growth temperature 
in prokaryotes, BMC Genomics, 10, 346. 

4. Rawal, P., Kummarasetti, V.B., Ravindran, J., et al. 2006, 
Genome-wide prediction of G4 DNA as regulatory 
motifs: role in Escherichia coli global regulation, Genome 
Res., 16,644-55. 

5. Kozobay-Avraham, L, Hosid, S. and Bolshoy, A. 2006, 
Involvement of DNA curvature in intergenic regions of 
prokaryotes, Nucleic Acids Res., 34, 2316-2 7. 

6. Mrazek, J. 201 0, Comparative analysis of sequence peri- 
odicity among prokaryotic genomes points to differences 
in nucleoid structure and a relationship to gene expres- 
sion,/. Bacteriol., 1 92, 3763-72. 

7. Mrazek, J., Guo, X.X. and Shah, A. 2007, Simple sequence 
repeats in prokaryotic genomes, Proc. Natl Acad. Sci. 
USA, 104, 8472-7. 

8. Thanbichler, M., Wang, S.C. and Shapiro, L. 2005,The bac- 
terial nucleoids highlyorganized and dynamic structure, 
}. Cell Biochem., 96, 506-21. 

9. Dillon, S.C. and Dorman, C.J. 2010, Bacterial nucleoid- 
associated proteins, nucleoid structure and gene expres- 
sion, Not Rev. Microbiol., 8, 1 85-95. 

1 0. Moxon, E.R., Rainey, P.B., Nowak, M.A. and Lenski, R.E. 
1 994, Adaptive evolution of highly mutable loci in patho- 
genic bacteria, Curr. Biol., 4, 24-33. 

1 1 . van der Woude, M.W. and Baumler, A.J. 2004, Phase and 
antigenic variation in bacteria, Clin. Microbiol. Rev., 1 7, 
581-61 1, table of contents. 

1 2. Brazda, V, Laister, R.C., Jagelska, E.B. and Arrowsmith, C. 
2011, Cruciform structures are a common DNA feature 
important for regulating biological processes, BMC Mol. 
Biol., 12, 33. 

1 3. Levinson,G. and Gutman,G.A. 1 987,Slipped-strand mis- 
pairings major mechanism for DNA sequence evolution, 
Mol. Biol. Evol., 4, 203-21. 

14. Dutra, B.E. and Lovett, ST. 2006, Cis and trans-acting 
effects on a mutational hotspot involving a replication 
template switch,/. Mol. Biol., 356, 300-1 1 . 

1 5. Lovett, ST. 2004, Encoded errors: mutations and rearran- 
gements mediated by misalignment at repetitive DNA 
sequences, Mol. Microbiol., 52, 1 243-53. 

1 6. Chuzhanova, N., Abeysinghe, S.S., Krawczak, M. and 
Cooper, D.N. 2003, Translocation and gross deletion 
breakpoints in human inherited disease and cancer II: po- 
tential involvement of repetitive sequence elements in 
secondary structure formation between DNA ends, 
Hum. Mutat, 22, 245-51. 

1 7. Burge,S., Parkinson, G.N., Hazel, P.,Todd,A.K. and Neidle,S. 
2006, Quadruplex DNA: sequence, topology and struc- 
ture, NucleicAcids Res., 34, 5402-1 5. 



No. 3] 



Y. Huang and J. Mrazek 



297 



1 8. Vorlickova, M., Bednafova, K., Kejnovska, I. and Kypr, J. 
2007, Intramolecularand intermolecularguaninequad- 
ruplexes of DNA in aqueous salt and ethanol solutions, 
Biopolymers, 86, 1-10. 

19. Shafer, R.H. and Smirnov, I. 2000, Biological aspects of 
DNA/RNA quadruplexes, Biopolymers, 56, 209-27. 

20. Schaeffer, C, Bardoni, B., Mandel, J.L, Ehresmann, B., 
Ehresmann, C. and Moine, H. 2001, The fragile X 
mental retardation protein binds specifically to its 
mRNAvia a purine quartet motif, EMBOJ., 20, 4803-1 3. 

21 . Pan, B., Shi, K. and Sundaralingam, M. 2006, Base-tetrad 
swapping results in dimerization of RNA quadruplexes: 
implications for formation of the i-motif RNA octaplex, 
Proc. Natl Acad. Sci. USA, 1 03, 3 1 3 0-4. 

22. Arthanari, H. and Bolton, RH. 2001 , Functional and dys- 
functional roles of quadruplex DNA in cells, Chem. Biol., 
8,221-30. 

23. Sundquist, W.I. and Heaphy, S. 1 993, Evidence for inter- 
strand quadruplex formation in the dimerization of 
human immunodeficiency virus 1 genomic RNA, Proc. 
Natl Acad. Sci. USA, 90, 3393-7. 

24. Belotserkovskii, B.R, Veselkov, A.G., Filippov, S.A., 
Dobrynin, V.N., Mirkin, S.M. and Frank-Kamenetskii, M.D. 
1 990, Formation of intramolecular triplex in homopur- 
ine-homopyrimidine mirror repeats with point substitu- 
tions, Nucleic Acids Res., 1 8, 662 1 -4. 

25. Rustighi, A., Tessari, M.A., Vascotto, F, Sgarra, R., 
Giancotti, V. and Manfioletti, G. 2002, A polypyrimi- 
dine/polypurine tract within the Hmga2 minimal pro- 
moter: a common feature of many growth-related 
genes, Biochemistry, 41 , 1 229-40. 

26. Zain, R. and SunJ.S. 2003, Do natural DNA triple-helical 
structures occur and function in vivo?, Cell Mol. Life Sci., 
60, 862-70. 

27. Rich, A., Nordheim, A. and Azorin, F. 1 983, Stabilization 
and detection of natural left-handed Z-DNA, J. Biomol. 
Struct. Dyn., 1, 1 -19. 

28. van Holde, K. and ZlatanovaJ. 1 994, Unusual DNA struc- 
tures, chromatin and transcription, Bioessays, 16, 59-68. 

29. Herbert, A. and Rich, A. 1 999, Left-handed Z-DNA: struc- 
ture and function, Genetica, 106, 37-47. 

30. Rich, A. and Zhang, S. 2003, Timeline: Z-DNA: the long 
road to biological function, Nat. Rev. Genet., 4, 566-72. 

31. Wang, G. and Vasquez, K.M. 2007, Z-DNA, an active 
element in the genome, Front Biosci., 12, 4424-38. 

32. Jain, A, Wang, G. and Vasquez, K.M. 2008, DNA triple 
helices: biological consequences and therapeutic poten- 
tial, Biochimie, 90, 1 1 1 7-30. 

33. Tolstorukov, M.Y., Virnik, K.M., Adhya, S. and Zhurkin, V.B. 
2005, A-tract clusters may facilitate DNA packaging in 
bacterial nucleoid, NucleicAcids Res., 33, 3907-1 8. 

34. Herzel, H., Weiss, O. and Trifonov, E.N. 1 998, Sequence 
periodicity in complete genomes of archaea suggests 
positive supercoilingj. Biomol. Struct. Dyn., 16, 341 -5. 

35. Pagani, I., Liolios, K.,Jansson,J., et al. 201 2, The Genomes 
OnLine Database (GOLD) v.4: status of genomic and 
metagenomic projects and their associated metadata, 
NucleicAcids Res., 40, D571 -9. 

36. Mrazek, J. andXie, S. 2006, Pattern locator: a new tool for 
finding local sequence patterns in genomic DNA 
sequences, Bioinformatics, 22, 3099-1 00. 



37. Mrazek,J. 2006, Analysis of distribution indicates diverse 
functions of simple sequence repeats in Mycoplasma 
genomes, Mol. Biol. Evol.,2i, 1 370-85. 

38. Lane, A.N., Chaires, J.B., Gray, R.D. and Trent, J.O. 2008, 
Stability and kinetics of G-quadruplex structures, Nucleic 
Acids Res. ,36, 5482-51 5. 

39. Vorlickova, M., Chladkova, J., Kejnovska, I., Fialova, M. and 
Kypr, J. 2005,Guaninetetraplextopologyof human telo- 
mere DNA is governed by the number of (TTAGGG) 
repeats, NucleicAcids Res. ,33, 5851 -60. 

40. Mrazek, J., Xie,S.,Guo,X. and Srivastava, A. 2008,AIMIE:a 
web-based environmentfordetection and interpretation 
of significant sequence motifs in prokaryotic genomes, 
Bioinformatics, 24, 1 041 -8. 

41 . Fang, W.,Wu,J.Y. and Su, M.J. 1 997, Methyl-directed repair 
of mismatched small heterologous sequences in cell 
extracts from Escherichia coli, ]. Biol. Chem., 272, 
22714-20. 

42. Parker, B.O. and Marinus, M.G. 1 992, Repair of DNA het- 
eroduplexes containing small heterologous sequences in 
Escherichia coli, Proc. Natl Acad. Sci. USA, 89, 1 730-4. 

43. Ladoukakis, E.D. and Eyre-Walker, A. 2008, The excess of 
small inverted repeats in prokaryotes, J. Mol. Evoi, 67, 
291 -300. 

44. Katz, L. and Burge, C.B. 2003, Widespread selection for 
local RNA secondary structure in coding regions of bac- 
terial genes, Genome Res., 1 3, 2042-51 . 

45. Paz, A, Mester, D., Baca, I., Nevo, E. and Korol, A. 2004, 
Adaptive role of increased frequency of polypurine 
tracts in mRNA sequences of thermophilic prokaryotes, 
Proc. Natl Acad. Sci. USA, 101,2951-6. 

46. Haney, P.J., BadgerJ.H., Buldak.G.L, Reich, C.I., Woese.C.R. 
and Olsen, G.J. 1 999, Thermal adaptation analyzed by 
comparison of protein sequences from mesophilic and 
extremely thermophilic Methanococcus species, Proc. 
Natl Acad. Sci. USA, 96,3578-83. 

47. Suhre, K. and Claverie, J.M. 2003, Genomic correlates of 
hyperthermostability, an update, /. Biol. Chem., 278, 
1 71 98-202. 

48. Kawashima,T.,Amano, N., Koike, H.,etal. 2000,Archaeal 
adaptation to higher temperatures revealed by genomic 
sequence of Thermoplasma volcanium, Proc. Natl Acad. 
Sci. USA, 97, 14257-62. 

49. Bolshoy,A. and Nevo, E. 2000, Ecologic genomics of DNA: 
upstream bending in prokaryotic promoters, Genome 
Res., 10, 1 1 85-93. 

50. Olivares-Zavaleta, N., Jauregui, R. and Merino, E. 2006, 
Genome analysis of Escherichia coli promoter sequences 
evidences that DNA static curvature plays a more import- 
ant role in gene transcription than has previously been 
anticipated, Genomics, 87, 329-37. 

51. Jauregui, R., Abreu-Goodger, C., Moreno-Hagelsieb, G., 
Collado-Vides, J. and Merino, E. 2003, Conservation of 
DNAcurvaturesignals in regulatory regions of prokaryot- 
ic genes, NucleicAcids Res., 31, 6770-7. 

52. Ussery, D.W., Higgins, C.F. and Bolshoy, A. 1 999, 
Environmental influences on DNA curvature,/. Biomol. 
Struct. Dyn., 16, 81 1 -23. 

53. Drake, J.W 2009, Avoiding dangerous missense:thermo- 
philes display especially low mutation rates, PLoS Genet, 
5,e1 000520. 



